Add MiniMax-M3 NVFP4 B200 single-node vLLM benchmark (EAGLE3 spec decode)#1933
Add MiniMax-M3 NVFP4 B200 single-node vLLM benchmark (EAGLE3 spec decode)#1933Ankur-singh wants to merge 5 commits into
Conversation
… decode New minimaxm3-fp4-b200-vllm-mtp config (fp4 vLLM aggregated on b200-dgxc with EAGLE3 speculative decoding, 3 draft tokens via Inferact/MiniMax-M3-EAGLE3). The benchmark script overlays vllm-project/vllm PR #46380 (MiniMax-M3 modelopt NVFP4 support, commit 6c08558) before serve and routes prompts through the chat template. Target weights are pre-staged at /scratch/fsw/models/MiniMax-M3-NVFP4 (added a minimaxm3-fp4 MODEL_PATH branch to launch_b200-dgxc.sh); the EAGLE3 draft is fetched next to the target weights.
| VLLM_DIR=$(python3 -c "import vllm, os; print(os.path.dirname(vllm.__file__))") | ||
| for f in \ | ||
| model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py \ | ||
| model_executor/layers/quantization/modelopt.py \ | ||
| model_executor/layers/quantization/utils/flashinfer_utils.py | ||
| do | ||
| curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}" |
There was a problem hiding this comment.
🟡 The patch step (lines 26-34) downloads three vLLM source files via curl and runs a Python import as a sanity check, but neither step has a fail-exit guard and the script does not set -e (nor does benchmark_lib.sh). If a download fails (network blip, GitHub raw outage, commit moved) or the import fails on a half-patched install, the script proceeds to vllm serve with broken/partially-patched code, producing a confusing serve-time error instead of a clean setup failure. Mirror the sibling minimaxm3_fp8_b200_mtp.sh:35 pattern: wrap the curl loop and the verify in || { echo "nvfp4 patch failed" >&2; exit 1; }.
Extended reasoning...
What the bug is
benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b200_mtp.sh:26-34 overlays three files from vllm-project/vllm PR #46380 (commit 6c08558) onto the installed vLLM package, then runs a single-import sanity check:
for f in \
model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py \
model_executor/layers/quantization/modelopt.py \
model_executor/layers/quantization/utils/flashinfer_utils.py
do
curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}"
done
python3 -c "from vllm.model_executor.layers.fused_moe.experts.trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular; print('[nvfp4-patch] OK')"Neither the loop body nor the python3 verification has a || exit 1 guard, and there is no set -e either in this script or inherited from benchmark_lib.sh (the only set -e in benchmark_lib.sh is locally scoped inside one helper function, not at file level).
Code path that triggers it
curl -fsSLreturns non-zero on HTTP error (e.g. raw.githubusercontent.com 5xx, the commit getting rewritten, the file getting renamed, or a network blip).- Without
set -e, the failedcurlinvocation does not abort the loop — it just moves on to the next iteration, leaving a stale or missing file behind. - The
python3 -cline then imports exactly one of the three patched modules (TrtLlmNvFp4ExpertsModularfromtrtllm_nvfp4_moe). If that one import succeeds butmodelopt.pyorflashinfer_utils.pywas not refreshed, the sanity check still prints[nvfp4-patch] OK. - Even when the import itself fails,
python3exits non-zero but the script proceeds straight tovllm serve.
Why existing code does not prevent it
set -eis not enabled at the script level (onlyset -xis set later, which is shell trace, not error-exit).benchmark_lib.shdoes not set-eglobally; the onlyset -eis local to a helper function (around line 1270).curl -fonly affectscurl's own exit code; it does not propagate to the shell unless combined with an exit guard.
Impact
A transient GH raw outage, a force-pushed commit, or a renamed file produces a confusing crash deep inside vllm serve (e.g. "unrecognised quantization config" or an ImportError from inside the engine) instead of a clean [nvfp4-patch] failed at script startup. Diagnosis costs an extra round-trip through engine logs for what is fundamentally a setup-time download failure.
How to fix
Mirror the sibling minimaxm3_fp8_b200_mtp.sh:35 pattern:
for f in \
model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py \
model_executor/layers/quantization/modelopt.py \
model_executor/layers/quantization/utils/flashinfer_utils.py
do
curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}" \
|| { echo "[nvfp4-patch] curl failed for ${f}" >&2; exit 1; }
done
python3 -c "from vllm.model_executor.layers.fused_moe.experts.trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular; print('[nvfp4-patch] OK')" \
|| { echo "[nvfp4-patch] import verify failed" >&2; exit 1; }Step-by-step proof
Concrete failure walk-through: suppose raw.githubusercontent.com returns a 503 for modelopt.py (transient outage):
- Iteration 1:
curlfortrtllm_nvfp4_moe.pysucceeds. - Iteration 2:
curlformodelopt.pyexits non-zero, but with noset -eand no|| exit, the shell loop advances. The stale (pre-patch)modelopt.pyfrom the base image remains on disk. - Iteration 3:
curlforflashinfer_utils.pysucceeds. - The verify line imports
TrtLlmNvFp4ExpertsModular— which was successfully overwritten — and prints[nvfp4-patch] OK. The half-patched state goes undetected. vllm servestarts, hits the unpatchedmodelopt.pywhen loading the NVFP4 quant config, and crashes with an opaque engine error.
With the fix, step 2 immediately prints [nvfp4-patch] curl failed for model_executor/layers/quantization/modelopt.py and exits 1 — a one-line setup failure instead of an engine-internal crash. Severity is nit because the script still ultimately errors; the failure is just less localised and slower to diagnose.
| - "Image vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-7a67223; benchmark script overlays vllm-project/vllm PR #46380 (MiniMax-M3 modelopt NVFP4 support, commit 6c08558) before serve; prompts routed through the chat template" | ||
| - "Target weights pre-staged at /scratch/fsw/models/MiniMax-M3-NVFP4 (added minimaxm3-fp4 MODEL_PATH branch to launch_b200-dgxc.sh); EAGLE3 draft fetched next to the target weights; --block-size 128 (MSA), --language-model-only" | ||
| - "Sweeps tp 4/8 with and without EP and dp-attn at 1k1k and 8k1k, conc 1-512" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX |
There was a problem hiding this comment.
🟡 The new perf-changelog entry at perf-changelog.yaml:4194 has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX — the XXX is a literal placeholder that was never substituted. Every other entry in this file uses the real PR number, so this will ship as a broken link in the released changelog. Trivial fix: replace XXX with 1933 before merge.
Extended reasoning...
What the bug is: The perf-changelog.yaml entry added by this PR (lines 4187-4194) ends with:\n\nyaml\n pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX\n\n\nThe XXX is a literal three-character placeholder that should have been substituted with the actual PR number before the change was pushed. This PR is #1933 (per the PR metadata).\n\nWhy this is a defect: perf-changelog.yaml is a structured, machine- and human-readable record of perf changes. Every other entry in the file uses the actual numeric PR id, e.g. line 4185 ends with pull/1927, line 4176 with pull/1762, and earlier entries reference 1865, 1706, etc. Anyone (or anything) consuming this file — release notes generators, the GitHub UI when the markdown is rendered, humans clicking the link — will land on https://github.com/SemiAnalysisAI/InferenceX/pull/XXX, which is not a valid PR. GitHub will return a 404.\n\nWhy existing code doesn't prevent it: There is no schema validation on pr-link that enforces a numeric trailing component (the value is just a string), and the PR was opened from a branch where the author hand-edited the changelog entry from a template and forgot to fill in the placeholder. CI passing doesn't catch this kind of metadata typo.\n\nImpact: Once merged, the changelog will contain a permanently broken link. The fix is more disruptive after merge (requires a follow-up PR) than now (one character change to the diff).\n\nStep-by-step proof:\n1. Open perf-changelog.yaml after this PR merges.\n2. Scroll to the bottom (lines 4187-4194), the new minimaxm3-fp4-b200-vllm-mtp entry.\n3. Read pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX.\n4. Click / curl that URL → GitHub returns 404 because XXX is not a valid PR number.\n5. Compare against the immediately preceding entry on line 4185, pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1927 — that one resolves correctly.\n\nHow to fix: In the diff, change line 4194 from:\nyaml\n pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX\n\nto:\nyaml\n pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1933\n
…lm-mtp # Conflicts: # perf-changelog.yaml
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28189601386 |
functionstackx
left a comment
There was a problem hiding this comment.
- fix patchwork as discussed in slack
- missing vllm recipes
The vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41 image bakes in MiniMax-M3 modelopt NVFP4 support (vllm-project/vllm PR #46380), so the EAGLE3 benchmark script no longer overwrites vLLM files at runtime.
…lm-mtp # Conflicts: # perf-changelog.yaml
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28197848860 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28197848860 |
Adds the
minimaxm3-fp4-b200-vllm-mtpconfig: MiniMax-M3 NVFP4 (nvidia/MiniMax-M3-NVFP4) single-node aggregated vLLM on B200 (runner: b200-dgxc) with EAGLE3 speculative decoding (spec-decoding: mtp, 3 draft tokens viaInferact/MiniMax-M3-EAGLE3).nvidia-master.yamlentry (fp4 / vllm / runnerb200-dgxc), every search-space rowspec-decoding: mtp; sweeps tp 4/8 with and without EP and dp-attn at 1k1k and 8k1k, conc 1-512.benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b200_mtp.sh— overlays vllm-project/vllm PR #46380 (MiniMax-M3 modelopt NVFP4 support, commit 6c08558) before serve;--speculative-configEAGLE3;--block-size 128(MSA),--language-model-only; prompts routed through the chat template./scratch/fsw/models/MiniMax-M3-NVFP4— added aminimaxm3 && fp4branch tolaunch_b200-dgxc.sh. The EAGLE3 draft (not staged) is fetched next to the target weights, as in the fp8 b200 MTP recipe.